The goal for this project is to train a classification algorithm with
anonymized insurance contract data in order to predict the churn of
insurance customers (whether the insurance company will lose a customer
or not). Since the outcome is categorial (customer stays or leaves) and
the churn in the available training data is labeled, we face a
supervised learning problem that can be solved with a classification
model.
In order to measure the performance of different classfication models,
the F1-Score will serve as a metric.
A potential use case, where these insights could be applied, is: Any insurance company wants to improve (or at least not decrease) their customer loyalty. If the insurance case handlers could understand why a customer quits a contract and more importantly if a customer is close to quitting an ongoing contract, they could start countermeasures in order to improve customer service and maybe keep the customer. The output from this model could be provided as information within a dashboard, or implemented into existing applications which display contract information to the case handler. This information can then be used to make further decisions by the case handlers. The visualization in this project will be realized throug a dashboard.
To measure our models performance, we will use the specificity (True negatives / (True negatives + False positives)) and visualize this with a confusion matrix. This is the most important metric since we want to measure how well our model predicts true negatives (churn) and minimizes false positives (model predicting not churning, when churn would be correct) in order to give the case handlers the best overview of possible churn candidates. We will train different models and compare their performance based on specificity and the confusion matrix.
The model is successful when:
Load training and test dataset (provided from kaggle - see https://www.kaggle.com/datasets/k123vinod/insurance-churn-prediction-weekend-hackathon)
Note: the training dataset will be used to fit the model. The provided test dataset will be treated as new data to simulate the models performance, since the data is unlabeled.
LINK_train <-
"https://raw.githubusercontent.com/NicoHenzel/Insurance-Churn-Prediction/main/Data/Train.csv"
new_training_data <-
read_csv(LINK_train, show_col_types = FALSE)
LINK_new <-
"https://raw.githubusercontent.com/NicoHenzel/Insurance-Churn-Prediction/main/Data/Test.csv"
new_data <-
read_csv(LINK_new, show_col_types = FALSE)
Look at the first rows from each dataframe.
| Anonymized contract churn data | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Training set | ||||||||||||||||
| feature_0 | feature_1 | feature_2 | feature_3 | feature_4 | feature_5 | feature_6 | feature_7 | feature_8 | feature_9 | feature_10 | feature_11 | feature_12 | feature_13 | feature_14 | feature_15 | labels |
| -0.2765146 | -0.4244288 | 1.3449969 | -0.01228261 | 0.07622994 | 1.0766475 | 0.1821975 | 3 | 0 | 1 | 0 | 0 | 0 | 0 | 10 | 2 | 1 |
| 0.8535731 | 0.1509913 | 0.5038918 | -0.97917917 | -0.56935064 | -0.4114531 | -0.2519404 | 4 | 1 | 2 | 0 | 1 | 0 | 0 | 0 | 3 | 0 |
| 0.9477471 | -0.1738321 | 1.8256285 | -0.70347774 | 0.07622994 | -0.4114531 | -0.2519404 | 6 | 1 | 2 | 0 | 0 | 0 | 0 | 5 | 3 | 0 |
| 0.8535731 | -0.3814037 | 0.9845233 | -0.03946444 | -0.56935064 | -0.4114531 | -0.2519404 | 4 | 0 | 2 | 0 | 1 | 0 | 0 | 5 | 3 | 0 |
| 1.3244430 | 1.5905268 | -1.1783185 | -0.09771123 | -0.24656035 | -0.4114531 | -0.2519404 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 8 | 3 | 0 |
| Source: Kaggle - Customer Churn Prediction - Weekend Hackathon | ||||||||||||||||
| https://www.kaggle.com/datasets/k123vinod/insurance-churn-prediction-weekend-hackathon | ||||||||||||||||
new_data %>%
slice_head(n=5) %>%
gt() %>%
tab_header(
title = md("**Anonymized contract churn data**"),
subtitle = md("Unlabeled dataset")
) %>%
tab_source_note(
source_note = "Source: Kaggle - Customer Churn Prediction - Weekend Hackathon"
) %>%
tab_source_note(
source_note = "https://www.kaggle.com/datasets/k123vinod/insurance-churn-prediction-weekend-hackathon"
)
| Anonymized contract churn data | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Unlabeled dataset | |||||||||||||||
| feature_0 | feature_1 | feature_2 | feature_3 | feature_4 | feature_5 | feature_6 | feature_7 | feature_8 | feature_9 | feature_10 | feature_11 | feature_12 | feature_13 | feature_14 | feature_15 |
| 0.5710512 | 0.4068430 | 0.9845233 | 0.0110161 | -0.56935064 | -0.4114531 | -0.2519404 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 11 | 3 |
| -1.1240804 | -0.1669349 | 0.5038918 | -0.3229321 | 0.72181052 | 0.5473231 | 0.1821975 | 0 | 2 | 1 | 0 | 0 | 0 | 0 | 5 | 1 |
| 0.4768772 | 0.1450794 | -0.5775291 | -0.6918284 | -0.24656035 | -0.4114531 | -0.2519404 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 3 |
| 1.6069650 | -0.4474193 | 1.8256285 | -0.9830623 | 7.17761632 | -0.4114531 | -0.2519404 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 5 | 3 |
| -0.9357324 | -0.3646534 | -1.1783185 | -0.3229321 | 0.07622994 | -0.4114531 | -0.2519404 | 8 | 2 | 1 | 0 | 1 | 0 | 2 | 8 | 3 |
| Source: Kaggle - Customer Churn Prediction - Weekend Hackathon | |||||||||||||||
| https://www.kaggle.com/datasets/k123vinod/insurance-churn-prediction-weekend-hackathon | |||||||||||||||
This gives us the following informations:
The data seems to be clean already. Further information will be gathered in Format Data section.
The datasets are divided by a 75/25 split. The dimensions for both dataframes are the following:
Training set:
observations = 33908
variables = 17
New dataset:
observations = 11303
variables = 16
The datasets are already clean (lowercase column names without spaces and no special characters). This makes it easier for us, since no data cleaning needs to be performed.
Next we will be inspecting the given data structure and check if there are any missing values in both datasets.
glimpse(new_training_data)
## Rows: 33,908
## Columns: 17
## $ feature_0 <dbl> -0.276514595, 0.853573137, 0.947747115, 0.853573137, 1.3244~
## $ feature_1 <dbl> -0.42442881, 0.15099126, -0.17383206, -0.38140368, 1.590526~
## $ feature_2 <dbl> 1.34499695, 0.50389181, 1.82562845, 0.98452332, -1.17831846~
## $ feature_3 <dbl> -0.012282614, -0.979179166, -0.703477740, -0.039464445, -0.~
## $ feature_4 <dbl> 0.07622994, -0.56935064, 0.07622994, -0.56935064, -0.246560~
## $ feature_5 <dbl> 1.0766475, -0.4114531, -0.4114531, -0.4114531, -0.4114531, ~
## $ feature_6 <dbl> 0.1821975, -0.2519404, -0.2519404, -0.2519404, -0.2519404, ~
## $ feature_7 <dbl> 3, 4, 6, 4, 0, 4, 9, 9, 8, 7, 1, 1, 11, 10, 9, 7, 6, 9, 8, ~
## $ feature_8 <dbl> 0, 1, 1, 0, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 0, 2, 2, 2, 2, 2,~
## $ feature_9 <dbl> 1, 2, 2, 2, 1, 1, 1, 1, 2, 1, 1, 0, 0, 1, 1, 1, 2, 1, 1, 1,~
## $ feature_10 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ feature_11 <dbl> 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1,~
## $ feature_12 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ feature_13 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 0, 2, 0, 0, 2, 0, 0, 0,~
## $ feature_14 <dbl> 10, 0, 5, 5, 8, 10, 5, 5, 8, 8, 6, 5, 5, 8, 1, 0, 6, 0, 8, ~
## $ feature_15 <dbl> 2, 3, 3, 3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 3,~
## $ labels <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,~
vis_dat(new_training_data)
glimpse(new_data)
## Rows: 11,303
## Columns: 16
## $ feature_0 <dbl> 0.57105120, -1.12408039, 0.47687723, 1.60696496, -0.9357324~
## $ feature_1 <dbl> 0.40684299, -0.16693490, 0.14507941, -0.44741934, -0.364653~
## $ feature_2 <dbl> 0.9845233, 0.5038918, -0.5775291, 1.8256285, -1.1783185, 0.~
## $ feature_3 <dbl> 0.01101610, -0.32293211, -0.69182838, -0.98306229, -0.32293~
## $ feature_4 <dbl> -0.56935064, 0.72181052, -0.24656035, 7.17761632, 0.0762299~
## $ feature_5 <dbl> -0.4114531, 0.5473231, -0.4114531, -0.4114531, -0.4114531, ~
## $ feature_6 <dbl> -0.2519404, 0.1821975, -0.2519404, -0.2519404, -0.2519404, ~
## $ feature_7 <dbl> 0, 0, 0, 1, 8, 1, 0, 4, 4, 10, 4, 0, 8, 1, 10, 3, 10, 4, 9,~
## $ feature_8 <dbl> 1, 2, 1, 1, 2, 2, 2, 0, 1, 1, 2, 0, 2, 1, 1, 0, 2, 2, 0, 1,~
## $ feature_9 <dbl> 1, 1, 1, 0, 1, 3, 1, 1, 2, 1, 2, 1, 1, 1, 0, 1, 1, 2, 2, 2,~
## $ feature_10 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ feature_11 <dbl> 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1,~
## $ feature_12 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ feature_13 <dbl> 0, 0, 0, 0, 2, 2, 0, 2, 0, 0, 2, 2, 2, 0, 0, 0, 0, 0, 0, 2,~
## $ feature_14 <dbl> 11, 5, 1, 5, 8, 6, 3, 8, 5, 0, 8, 8, 8, 8, 3, 9, 1, 1, 9, 6~
## $ feature_15 <dbl> 3, 1, 3, 3, 3, 3, 0, 3, 3, 3, 3, 3, 3, 0, 3, 3, 3, 3, 3, 3,~
vis_dat(new_data)
We can see that:
| labels | n |
|---|---|
| 0 | 29941 |
| 1 | 3967 |
First the labels column will be renamed to churn. The column is then transformed into a factor type. This is needed for running the classification algorith.
# Rename labels
df_train <-
new_training_data %>%
rename(churn = labels) %>%
# Change column type to factor
mutate(
churn = as.factor(churn)
)
This way the churn is more accurately represented by thechurn variable:
| Insurance customers churn rate | ||
|---|---|---|
| Anonymized training sample | ||
| Churn | Amount | Percent |
| 0 | 29941 | 88.3 |
| 1 | 3967 | 11.7 |
Note:
No new variables will be created, since we work with anonymized
data.
The Data exploration section covers analysis and
interpretation of the relations between the variables.
skim(df_train) %>%
gt()
| skim_type | skim_variable | n_missing | complete_rate | factor.ordered | factor.n_unique | factor.top_counts | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 | numeric.hist |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| factor | churn | 0 | 1 | FALSE | 2 | 0: 29941, 1: 3967 | NA | NA | NA | NA | NA | NA | NA | NA |
| numeric | feature_0 | 0 | 1 | NA | NA | NA | -4.157719e-03 | 0.9997759 | -2.1599941 | -0.7473845 | -0.18234062 | 0.66522518 | 5.091402 | ▅▇▃▁▁ |
| numeric | feature_1 | 0 | 1 | NA | NA | NA | 2.583825e-03 | 1.0142678 | -3.0811485 | -0.4227866 | -0.29732404 | 0.02290117 | 33.094776 | ▇▁▁▁▁ |
| numeric | feature_2 | 0 | 1 | NA | NA | NA | -2.127901e-04 | 1.0008723 | -1.7791078 | -0.9380027 | 0.02326031 | 0.62404969 | 1.825628 | ▆▅▇▅▆ |
| numeric | feature_3 | 0 | 1 | NA | NA | NA | -5.287459e-05 | 1.0025117 | -1.0024779 | -0.6025167 | -0.30351652 | 0.23623698 | 18.094700 | ▇▁▁▁▁ |
| numeric | feature_4 | 0 | 1 | NA | NA | NA | -2.980493e-04 | 1.0037235 | -0.5693506 | -0.5693506 | -0.24656035 | 0.07622994 | 19.443647 | ▇▁▁▁▁ |
| numeric | feature_5 | 0 | 1 | NA | NA | NA | -4.651946e-03 | 0.9939836 | -0.4114531 | -0.4114531 | -0.41145311 | -0.41145311 | 8.127648 | ▇▁▁▁▁ |
| numeric | feature_6 | 0 | 1 | NA | NA | NA | -7.497738e-03 | 0.8026964 | -0.2519404 | -0.2519404 | -0.25194037 | -0.25194037 | 23.625644 | ▇▁▁▁▁ |
| numeric | feature_7 | 0 | 1 | NA | NA | NA | 4.336381e+00 | 3.2733763 | 0.0000000 | 1.0000000 | 4.00000000 | 7.00000000 | 11.000000 | ▇▅▂▂▅ |
| numeric | feature_8 | 0 | 1 | NA | NA | NA | 1.171051e+00 | 0.6067304 | 0.0000000 | 1.0000000 | 1.00000000 | 2.00000000 | 2.000000 | ▂▁▇▁▃ |
| numeric | feature_9 | 0 | 1 | NA | NA | NA | 1.225345e+00 | 0.7491039 | 0.0000000 | 1.0000000 | 1.00000000 | 2.00000000 | 3.000000 | ▂▇▁▅▁ |
| numeric | feature_10 | 0 | 1 | NA | NA | NA | 1.813731e-02 | 0.1334499 | 0.0000000 | 0.0000000 | 0.00000000 | 0.00000000 | 1.000000 | ▇▁▁▁▁ |
| numeric | feature_11 | 0 | 1 | NA | NA | NA | 5.555031e-01 | 0.4969172 | 0.0000000 | 0.0000000 | 1.00000000 | 1.00000000 | 1.000000 | ▆▁▁▁▇ |
| numeric | feature_12 | 0 | 1 | NA | NA | NA | 1.596673e-01 | 0.3663027 | 0.0000000 | 0.0000000 | 0.00000000 | 0.00000000 | 1.000000 | ▇▁▁▁▂ |
| numeric | feature_13 | 0 | 1 | NA | NA | NA | 6.394066e-01 | 0.8976269 | 0.0000000 | 0.0000000 | 0.00000000 | 2.00000000 | 2.000000 | ▇▁▁▁▃ |
| numeric | feature_14 | 0 | 1 | NA | NA | NA | 5.520497e+00 | 3.0032412 | 0.0000000 | 3.0000000 | 6.00000000 | 8.00000000 | 11.000000 | ▅▂▇▇▃ |
| numeric | feature_15 | 0 | 1 | NA | NA | NA | 2.562375e+00 | 0.9871483 | 0.0000000 | 3.0000000 | 3.00000000 | 3.00000000 | 3.000000 | ▁▁▁▁▇ |
The overview shows:
Before exploring the data we perform a data split. The Data exploration will only be done on the training set. This is crucial so we don’t use any information from the training set in our model building phase. Although the data already comes with a 75/25 split the provided test data will not be used for model training since it is unlabeled and better suited to simulate new data.
# Seeded split to provide the same split everytime the split is made to make sure we always use the same data for model training.
set.seed(42)
# Split the data with a 75/25 proportion in favor for the training set
data_split <-
initial_split(
df_train, # original training dataset
prop = 3/4, # 75/25 split
strata = churn # stratified on churn variable
)
# Create the dataframes for training and testing
train_data <- training(data_split)
test_data <- testing(data_split)
In order to get a better understanding of the data and underlying relations between features, we plot the data in different ways.
We create a new dataframe to avoid altering the training dataset during the data exploration.
explore_data <-
train_data
Then we change the binary values to named values to represent the churn in order to make the plots easier to read (Using yes and no instead of 1 and 0).
# Change column type to character to change values
explore_data <-
explore_data %>%
mutate(
churn = as.character(churn)
)
# Change values
explore_data$churn[explore_data$churn == "1"] <-
"yes"
explore_data$churn[explore_data$churn == "0"] <-
"no"
#Change column type back to factor
explore_data <-
explore_data %>%
mutate(
churn = as.factor(churn)
)
The data is imbalanced as nearly 90 % are labeled as not churning. We will need to compensate for this before performing hyperparameter tuning.
The correlation matrix helps us see different relations between the
features as it may indicate a predictive relationship between variables
(FUßNOTE EINFÜGEN: https://en.wikipedia.org/wiki/Correlation). It is formed
by calculating the correlation coefficient r between
each feature. r ranges from -1 to 1 and indicates the
strength and direction of the linear relation:
Positive numbers indicate that greater values of one variable lead to
greater values of the other variable which also holds true for lower
values (similar behavior).
Negative numbers indicate that greater values of one variable correspond
to lesser values of the other (opposite behavior).
(See: https://en.wikipedia.org/wiki/Correlation and https://en.wikipedia.org/wiki/Covariance)
It is also useful to show the specific correlation of each feature to our churn variable:
We can see that the correlation between churn and the other features is the following:
To further inspect the relation between churn and other variables we will create plots for each variable.
OFFEN:
* Plots mit GGarrange in einer Grafik zusammenfassen Siehe: https://www.datanovia.com/en/lessons/combine-multiple-ggplots-into-a-figure/
We will start with boxplots and histograms for the float variables.
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
This confirms our previous insights from the Data overview and Correlation to churn section:
Most of the variables do not have a significant influence on the churn label.
feature_3 has a noticeable impact on churn.
feature_5 & 6 can be interesting candidates for model training. This will be testes in the Model section
There are alot of outliers present for almost every feature. This can be seen in the histograms as well as the barplots (Note that features 1, 4, 5 & 6 have been filtered to better display the respective boxplots).
The scales for float variables are very different.
Next up are the boxplots and histograms for the integer variables.
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
##
## [[9]]
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
##
## [[9]]
Again, this is a confirmation of what we have seen before:
feature 11 has a noticeable impact on the churn variable.
feature 13 can also be an interesting candidate for model training. This will be tested in the Model section.
feature 7 - 15 contain discrete integer values, in different levels. feature 10 - 12 contain binary values (0 and 1) for example.
The scales for integer variables are also different (although not to the same extent as for the float variables).
As a conclusion, we will train our model with
Before model training we will:
Select the outcome and predictors for our model
df_train_new <-
df_train %>%
select(feature_3,
feature_5,
feature_6,
feature_11,
feature_13,
churn)
The data split will be done once again on the df_train_new in order to only contain the relevant features.
# Seeded split to provide the same split everytime the split is made to make sure we always use the same data for model training.
set.seed(42)
# Split the data with a 75/25 proportion in favor for the training set
data_split <-
initial_split(
df_train_new, # updated data
prop = 3/4, # 75/25 split
strata = churn # stratified on churn variable
)
# Create the dataframes for training and testing
train_data <- training(data_split)
test_data <- testing(data_split)
For the validation set a k-fold crossvalidation with a set of 10 validation folds will be used. Also, the sample is stratified. This makes sure that every characteristic of the data is properly represented in each sample. The validation set is used in order to check the models performance on the training dataset.
# Fix random generation, also see data split
set.seed(42)
# generate the cross validation set
cv_folds <-
vfold_cv(train_data,
v=10, # number of folds
strata = churn, # ensure the same proportion of the churn variable in every fold
breaks = 4)
First we create a recipe for our model that will apply the same steps to all data that we feed into our model.
# Create a recipe for our model
churn_rec <-
recipe(
# outcome ~ predictor
churn ~ .,
data = train_data) %>%
# perform z-standardization: normalize the numeric variables to have a standard deviation of one and a mean of zero.
step_normalize(all_numeric(), -all_outcomes()) %>%
# removes numeric variables that have no variance.
step_zv(all_numeric(), -all_outcomes()) %>%
# remove predictor variables that have large correlations with other predictor variables.
step_corr(all_predictors(), threshold = 0.7, method = "spearman")
These variables with respective roles will be used by our recipe.
summary(churn_rec)
We check how the recipe will affect our training data
prepped_data <-
churn_rec %>% # use the recipe object
prep() %>% # apply recipe
juice() # show the processed data
glimpse(prepped_data)
## Rows: 25,430
## Columns: 5
## $ feature_3 <dbl> -0.98232820, -0.70598997, -0.09882429, -0.32456537, 1.01820~
## $ feature_6 <dbl> -0.3003303, -0.3003303, -0.3003303, -0.3003303, -0.3003303,~
## $ feature_11 <dbl> 0.8951775, -1.1170530, -1.1170530, 0.8951775, 0.8951775, 0.~
## $ feature_13 <dbl> -0.7135851, -0.7135851, -0.7135851, -0.7135851, -0.7135851,~
## $ churn <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
prepped_data %>%
select(-churn) %>%
ggscatmat(corMethod = "spearman",
alpha=0.2)
The recipe is prepared such that the churn variable serves as the outcome which the model should predict. We can see that feature_5 has been removed. This happened by the step_corr() function in the recipe meaning feature_5 correlates with another feature and is not providing additional useful information in model training.
We will set up the models with the needed specification.
xgb_spec <-
# uses empirically best parameters, if none are specified
boost_tree() %>%
# model engine
set_engine("xgboost") %>%
# model mode
set_mode("classification")
# show model specification
xgb_spec
## Boosted Tree Model Specification (classification)
##
## Computational engine: xgboost
Next we create a workflow object to combine the recipe `r text_spec(“churn_rec”, color = “red”) with our model in Model execution.
# workflow pipeline
xgb_wflow <-
workflow() %>%
# specifiy our recipe
add_recipe(churn_rec) %>%
# and our model
add_model(xgb_spec)
# show workflow
xgb_wflow
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: boost_tree()
##
## -- Preprocessor ----------------------------------------------------------------
## 3 Recipe Steps
##
## * step_normalize()
## * step_zv()
## * step_corr()
##
## -- Model -----------------------------------------------------------------------
## Boosted Tree Model Specification (classification)
##
## Computational engine: xgboost
Finally we can execute our model and fit it to our training data. We save the predictions made so we can evaluate the performance metrics, especially the specificity.
# Create fit to the model and check with validation folds
xgb_res <-
xgb_wflow %>%
fit_resamples(
# Estimate performance for specified metrics on all folds by fitting model to each resample while holding out a portion from each resample to evaluate.
resamples = cv_folds,
# Define metrics
metrics = metric_set(
spec,
accuracy,
roc_auc,
recall,
precision
),
# Save predictions
control = control_resamples(save_pred = TRUE)
)
We can evaluate our metrics with collect_metrics().
The performance over all folds is:
xgb_res %>%
collect_metrics(summarize = TRUE)
Our models performance overall seems to be good. The only problem is that the specificity is low.
We then collect our predictions. This will be needed to create the confusion matrix and ROC curve
xgb_pred <-
xgb_res %>%
collect_predictions()
Next we create the confusion matrix and plot it.
xgb_pred %>%
conf_mat(churn, .pred_class) %>%
autoplot(type = "heatmap") +
labs(
title = "Confusion matrix",
subtitle = "XGB model",
x = "Truth",
y = "Prediction"
)
Overall our model has a high rate of true positives but (as we have seen already) a low specificity - rate of true negatives.
We also visualize the ROC:
xgb_pred %>%
group_by(id) %>% # id contains our folds
roc_curve(churn, .pred_0) %>%
autoplot() +
labs(
title = "ROC curve",
subtitle = "XGB model",
x = "False positive rate (1- specificity)",
y = "True positive rate (sensitivity)",
color = "Folds")
The ROC curve looks really good, but we need to remember that our dataset is still imbalanced.
In order to find a better model we will train and evaluate some other classifying algorithms and compare all models using the f1 score as a performance measure.
We will use the following algorithms
We repeat all our steps from above to create specifications, workflows and train the models.
# Logistic regression
log_spec <-
# will use best empirical hyperparamters
logistic_reg() %>%
set_engine(engine = "glm") %>%
set_mode("classification")
# Random forest
rf_spec <-
# will use best empirical hyperparamters
rand_forest() %>%
set_engine("ranger") %>%
set_mode("classification")
# K-nearest neighbor
knn_spec <-
# will use best empirical hyperparamters
nearest_neighbor(neighbors = 4) %>% # can be adjusted
set_engine("kknn") %>%
set_mode("classification")
# Logistic regression
log_wflow <-
workflow() %>%
add_recipe(churn_rec) %>%
add_model(log_spec)
# Random forest
rf_wflow <-
workflow() %>%
add_recipe(churn_rec) %>%
add_model(rf_spec)
# K-nearest neighbor
knn_wflow <-
workflow() %>%
add_recipe(churn_rec) %>%
add_model(knn_spec)
# Logistic regression
log_res <-
log_wflow %>%
fit_resamples(
resamples = cv_folds,
# Define metrics
metrics = metric_set(
spec,
accuracy,
roc_auc,
recall,
precision
),
# Save predictions
control = control_resamples(save_pred = TRUE)
)
# Random forest
rf_res <-
rf_wflow %>%
fit_resamples(
resamples = cv_folds,
# Define metrics
metrics = metric_set(
spec,
accuracy,
roc_auc,
recall,
precision
),
# Save predictions
control = control_resamples(save_pred = TRUE)
)
# K-nearest neighbor
knn_res <-
knn_wflow %>%
fit_resamples(
resamples = cv_folds,
# Define metrics
metrics = metric_set(
spec,
accuracy,
roc_auc,
recall,
precision
),
# Save predictions
control = control_resamples(save_pred = TRUE)
)
We compare the models by first looking at the different metrics and the confusion matrices.
log_res %>% collect_metrics(summarize = TRUE)
log_pred <-
log_res %>%
collect_predictions()
log_pred %>%
conf_mat(churn, .pred_class) %>%
autoplot("heatmap")+
labs(
title = "Confusion matrix",
subtitle = "LOG model",
x = "Truth",
y = "Prediction"
)
log_pred %>%
group_by(id) %>% # id contains our folds
roc_curve(churn, .pred_0) %>%
autoplot() +
labs(
title = "ROC curve",
subtitle = "LOG model",
x = "False positive rate (1- specificity)",
y = "True positive rate (sensitivity)",
color = "Folds")
rf_res %>% collect_metrics(summarize = TRUE)
rf_pred <-
rf_res %>%
collect_predictions()
rf_pred %>%
conf_mat(churn, .pred_class) %>%
autoplot(type = "heatmap")+
labs(
title = "Confusion matrix",
subtitle = "RF model",
x = "Truth",
y = "Prediction"
)
rf_pred %>%
group_by(id) %>% # id contains our folds
roc_curve(churn, .pred_0) %>%
autoplot() +
labs(
title = "ROC curve",
subtitle = "RF model",
x = "False positive rate (1- specificity)",
y = "True positive rate (sensitivity)",
color = "Folds")
knn_res %>% collect_metrics(summarize = TRUE)
knn_pred <-
knn_res %>%
collect_predictions()
knn_pred %>%
conf_mat(churn, .pred_class) %>%
autoplot(type = "heatmap")+
labs(
title = "Confusion matrix",
subtitle = "KNN model",
x = "Truth",
y = "Prediction"
)
knn_pred %>%
group_by(id) %>% # id contains our folds
roc_curve(churn, .pred_0) %>%
autoplot() +
labs(
title = "ROC curve",
subtitle = "KNN model",
x = "False positive rate (1- specificity)",
y = "True positive rate (sensitivity)",
color = "Folds")
xgb_res %>% collect_metrics(summarize = TRUE)
xgb_pred %>%
conf_mat(churn, .pred_class) %>%
autoplot("heatmap") +
labs(
title = "Confusion matrix",
subtitle = "XGB model",
x = "Truth",
y = "Prediction"
)
xgb_pred %>%
group_by(id) %>% # id contains our folds
roc_curve(churn, .pred_0) %>%
autoplot() +
labs(
title = "ROC curve",
subtitle = "XGB model",
x = "False positive rate (1- specificity)",
y = "True positive rate (sensitivity)",
color = "Folds")
We can see that our xgboost classification algorithm performed the best out of all models so far, considering specificity and F1-Score.
We will use that model and fine tune it in order to gain a better specificity and thus a higher prediction for churning.
To adjust our imbalanced data, we will use the synthetic minority oversampling technique smote(). This creates more samples of the minority class (in our case the data labeled as not churning) and still keeps the proportions to the majority (labeled as churning). This is needed to improve the performance of our model as it would otherwise have not enough data present to correctly predict the outcomes.
smote_data <-
# Generate more samples of the minority label (churn) while also keeping the proportion to the majority label (no churn)
smote(churn ~., train_data, perc.over = 3, perc.under = 2)
table(smote_data$churn)
##
## 0 1
## 17850 11900
The data ist still imbalanced, but not as much as before.
We will also need to create a new evaluation set with the adjusted dataset.
# Fix random generation, also see data split
set.seed(42)
# generate the cross validation set
cv_folds <-
vfold_cv(smote_data,
v=5, # number of folds
strata = churn, # ensure the same proportion of the churn variable in every fold
breaks = 4)
Now all we have to do is perform hyperparameter tuning our xgboost algorithm to select the best values to maximize specificity.
We have to specify the hyperparameters that we want to tune. We also adjust for the rest of the imbalance by using the scale_pos_weight hyperparameter to tune.
xgb_tuned_spec <-
boost_tree(
mtry = tune(), # number (or proportion) of predictors
trees = 50, # number of decision trees. For testing, we will leave it at a lower value
min_n = tune(), # minimum number of data points in a node required to split
tree_depth = tune(), # maximum splits (depth of decision trees)
learn_rate = tune(), # adaption rate over iterations
loss_reduction = tune(), # reduction in loss required to split
sample_size = tune(), # number (or proportion) of data exposed to fitting
) %>%
# model engine
# specifiy scale_pos_weight to account for the imbalanced dataset
set_engine("xgboost", scale_pos_weight = tune()) %>%
# model mode
set_mode("classification")
# show model specification
xgb_tuned_spec
## Boosted Tree Model Specification (classification)
##
## Main Arguments:
## mtry = tune()
## trees = 50
## min_n = tune()
## tree_depth = tune()
## learn_rate = tune()
## loss_reduction = tune()
## sample_size = tune()
##
## Engine-Specific Arguments:
## scale_pos_weight = tune()
##
## Computational engine: xgboost
Next we create a new workflow object.
# workflow pipeline
xgb_tuned_wflow <-
workflow() %>%
# specifiy our recipe
add_recipe(churn_rec) %>%
# and our model
add_model(xgb_tuned_spec)
# show workflow
xgb_tuned_wflow
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: boost_tree()
##
## -- Preprocessor ----------------------------------------------------------------
## 3 Recipe Steps
##
## * step_normalize()
## * step_zv()
## * step_corr()
##
## -- Model -----------------------------------------------------------------------
## Boosted Tree Model Specification (classification)
##
## Main Arguments:
## mtry = tune()
## trees = 50
## min_n = tune()
## tree_depth = tune()
## learn_rate = tune()
## loss_reduction = tune()
## sample_size = tune()
##
## Engine-Specific Arguments:
## scale_pos_weight = tune()
##
## Computational engine: xgboost
In order to tune the parameters we first set up possible values for
them. For efficiency reasons we use grid_latin_hypercube().
For more information see:
https://htmlpreview.github.io/?https://github.com/kirenz/tidymodels-in-r/blob/main/05-tidymodels-xgboost-tuning.html
xgb_grid <- grid_latin_hypercube(
scale_pos_weight(),
tree_depth(),
min_n(),
loss_reduction(),
sample_size = sample_prop(), # needs to be a proportion
finalize(mtry(), smote_data), # different approach since mtry depends on number of predictors in data
learn_rate(),
size = 50
)
xgb_grid
The tune_grid function computes the performance metrics for us, which we defined earlier.
# Seeded to make sure we can reproduce the tuning
set.seed(42)
xgb_tuned_res <- tune_grid(
xgb_tuned_wflow,
resamples = cv_folds,
grid = xgb_grid,
# save the predictions to visualize the data
metrics = metric_set(
spec,
accuracy,
roc_auc,
recall,
precision
),
control = control_grid(save_pred = TRUE)
)
metrics <-
xgb_tuned_res %>%
collect_metrics(summarize = TRUE)
We visualize the AUC for every hyperparamater which gives us a first impression of how well specific values for our hyperparamaters perform.
# Credits for the code chunk goes to:
# https://htmlpreview.github.io/?https://github.com/kirenz/tidymodels-in-r/blob/main/05-tidymodels-xgboost-tuning.html
# Pivot xgb_tuned_res to longer data format while only filtering for spec
xgb_res_spec <-
xgb_tuned_res %>%
collect_metrics() %>%
filter(.metric == "spec") %>%
select(mean, mtry:sample_size) %>%
pivot_longer(mtry:sample_size,
values_to = "value",
names_to = "parameter"
)
# show specificity side by side for every hyperparameter
xgb_res_spec %>%
ggplot(aes(value, mean, color = parameter)) +
geom_point(alpha = 0.8, show.legend = FALSE) +
facet_wrap(~parameter, scales = "free_x") +
scale_color_viridis_d()+
labs(x = NULL,
y = "Specificty",
title = "Different values for specificty",
subtitle = "For each hyperparameter after tuning")
In order to train our model with the best hyperparameters, we evaluate them based on the specificity value depicted in the precious graph.
show_best(xgb_tuned_res,
"spec")
best_spec <-
select_best(
xgb_tuned_res,
"spec"
)
best_spec
This set of values will be chosen for our hyperparameters as it overall provides the best specificity values.
We create a final workflow which uses our best set of tuned hyperparameters.
final_xgb_Wf <- finalize_workflow(
xgb_tuned_wflow,
best_spec
)
final_xgb_Wf
## == Workflow ====================================================================
## Preprocessor: Recipe
## Model: boost_tree()
##
## -- Preprocessor ----------------------------------------------------------------
## 3 Recipe Steps
##
## * step_normalize()
## * step_zv()
## * step_corr()
##
## -- Model -----------------------------------------------------------------------
## Boosted Tree Model Specification (classification)
##
## Main Arguments:
## mtry = 4
## trees = 50
## min_n = 22
## tree_depth = 4
## learn_rate = 0.000355576439728219
## loss_reduction = 0.00168145148229085
## sample_size = 0.628655545041431
##
## Engine-Specific Arguments:
## scale_pos_weight = 0.828226032601669
##
## Computational engine: xgboost
Fit the model with our finalized workflow.
xgb_tuned_res <-
final_xgb_Wf %>%
fit_resamples(
resamples = cv_folds,
metrics = metric_set(
spec,
accuracy,
roc_auc,
recall,
precision
),
# Save predictions
control = control_resamples(save_pred = TRUE)
)
Evaluate the metrics over all folds and show the confusion matrix and ROC curve.
xgb_tuned_res %>%
collect_metrics(summarize = TRUE)
xgb_tuned_pred <-
xgb_tuned_res %>%
collect_predictions()
xgb_tuned_pred %>%
conf_mat(churn, .pred_class) %>%
autoplot(type = "heatmap")
xgb_tuned_pred %>%
group_by(id) %>% # id contains our folds
roc_curve(churn, .pred_0) %>%
autoplot()
We can see that our tuned model outperformed every other model regarding the specificity. It is still at lower percentage than we would like to have, but we already tuned the model to obtain the best hyperparameters and will still use this model on the test data. In order to improve the performance further, custom performance metric could be generated.
We can now perform our last fit to the test data we split of earlier.
# perform last fit of our model on test data
last_fit_xgb <-
last_fit(final_xgb_Wf,
split = data_split,
metrics = metric_set(
spec,
accuracy,
roc_auc,
recall,
precision
)
)
To see how our model performed we check the performance metrics
last_fit_xgb %>%
collect_metrics()
last_fit_xgb %>%
collect_predictions() %>%
conf_mat(churn, .pred_class) %>%
autoplot(type = "heatmap")
The specificity is at a low value which means our model did not predict the churn label well regarding the test data. One reason could be that the training data does not contain enough samples that are labeled as churn. Normally we would tackle this by training another model and improving our feature engineering to obtain better predictors.
vip_features <-
last_fit_xgb %>%
pluck(".workflow", 1) %>%
pull_workflow_fit() %>%
vip(num_features = 10)
## Warning: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
## Please use `extract_fit_parsnip()` instead.
ggplotly(vip_features)
The most important feature for the prediction was feature_3. The others have a low importane which is another indicator that feature engineering could lead to better results on the test data.
Even if the model predicts only about 30% of the actual churning customers, We can still use it for our use case since it has a very high rate of correctly predicting customers that won’t churn. This means it is still better than randomly guessing.
We will use our model on new data, which has not been labeled yet. We fit it to the second dataset provided by kaggle which we stored in our new_data dataframe.
In order to make predictions, we need to create an object of class model_fit of our final_xgb_wf. This is done using the fit function.
# create final model
final_model <-
fit(final_xgb_Wf, df_train_new)
## [14:53:02] WARNING: amalgamation/../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
# create predictions
predictions <-
predict(final_model, new_data)
# create new dataframe
new_prediction_data <-
new_data
# add predictions to new_data_prediction
new_prediction_data$churn_prediction <-
predictions$.pred_class
# put predictions as first column
new_prediction_data <-
new_prediction_data %>%
select(churn_prediction, everything())
We change the prediction labels to better represent the churn.
new_prediction_data <-
new_prediction_data %>%
mutate(
churn_prediction = as.character(churn_prediction)
)
new_prediction_data$churn_prediction[new_prediction_data$churn_prediction == 0] <- "no"
new_prediction_data$churn_prediction[new_prediction_data$churn_prediction == 1] <- "yes"
Let’s take a look at the transformed data
datatable(new_prediction_data,
extensions = list(
'Buttons' = NULL,
"Responsive" = NULL
),
options = list(
dom = 'Bfrtip',
buttons = c( 'csv', 'excel', 'pdf')
)
)
## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html
Next we recreate the pie chart for our predictions
new_prediction_data %>%
count(churn_prediction, name ="churn_total") %>%
mutate(percent = churn_total/sum(churn_total)*100,
percent = round(percent, 2)) %>%
ggplot(
aes(x="",
y=percent,
fill=churn_prediction)
) +
geom_bar(
stat="identity",
width=1
) +
coord_polar("y") +
theme_classic() +
theme(
axis.line = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank()
) +
scale_fill_manual(
values=c("#ffdb58", "#bcd4e6")
) +
labs(
x = NULL,
y = NULL,
fill = NULL,
title = "Predicted customer churn distribution",
subtitle = "For insurance contracts") +
geom_text(
aes(label = paste0(percent, "%")),
position = position_stack(vjust=0.5)
)
new_prediction_data %>%
count(churn_prediction, name ="churn_total") %>%
mutate(percent = churn_total/sum(churn_total)*100,
percent = round(percent, 2))
We then save the predictions on the new data, our pie chart and the plotly graph of our vip features to our Dashboard folder in order to use it in our dashboard.
write.csv(new_prediction_data, "Dashboard\\Results.csv", row.names=FALSE)
fig <-
ggplotly(vip_features)
htmlwidgets::saveWidget(fig,
file="Dashboard\\insurance-churn-dashboard\\www\\plotly.html",
selfcontained = TRUE)
png("Dashboard\\insurance-churn-dashboard\\www\\pie-chart.png")
par(mar = c(4.1, 4.4, 4.1, 1.9), xaxs="i", yaxs="i")
new_prediction_data %>%
count(churn_prediction, name ="churn_total") %>%
mutate(percent = churn_total/sum(churn_total)*100,
percent = round(percent, 2)) %>%
ggplot(
aes(x="",
y=percent,
fill=churn_prediction)
) +
geom_bar(
stat="identity",
width=1
) +
coord_polar("y") +
theme_classic() +
theme(
axis.line = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank()
) +
scale_fill_manual(
values=c("#ffdb58", "#bcd4e6")
) +
labs(
x = NULL,
y = NULL,
fill = NULL,
title = "Predicted customer churn distribution",
subtitle = "For insurance contracts") +
geom_text(
aes(label = paste0(percent, "%")),
position = position_stack(vjust=0.5)
)
dev.off()
## png
## 2
This report shows how to build a classifier algorithm for anonymized data.We have seen that our model still has room for improvement. I suggested, that additional eature engineering and using less imbalanced data could provide a model with better results.
Since the data provided was anonymized, you will find some additional information below in order to give you an overview of potential real world features that impact the churn rate of insurance customers.
For example the … To DO: Research und relevante Features beschreiben ## Relevant features